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ABSTRACT 

An  explicit  formulation  of  the  concept  of  non- informative  prior 
distribution  over  a  finite  number  of  possibilities  is  given.  Numerical 
examples  show  that  the  formulation  leads  to  non-trivial  results.  An 
information  inequality  is  established  to  assure  the  validity  of  numerical 
results.  The  relation  of  the  present  work  to  other  works  on  the  same  subject 
is  briefly  discussed. 
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SIGNIFICANCE  AND  EXPLANATION 


The  concept  of  non- informative  prior  distribution  has  been  useful  in 
developing  Bayesian  procedures  for  practical  applications*  However,  rigorous 
analysis  of  the  concept  in  the  case  of  finite  number  of  possible  alternatives 
has  never  been  sufficiently  developed.  In  this  paper  a  new  definition,  the 
minimum  information  prior  distribution,  is  introduced  based  on  the  predictive 
point  of  view.  The  characteristic  of  the  minimum  information  prior 
distribution  is  analyzed  numerically  and  non-trivial  examples  of  determination 
of  prior  probability  distribution  over  a  finite  number  of  possibilities  are 
reported. 
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ON  MINIMUM  INFORMATION  PRIOR  DISTRIBUTIONS 


Hirotugu  Akaike* 


1.  INTRODUCTION 

In  a  practical  application  of  the  Bayes  procedure  the  available  prior 
information  is  not  usually  sufficient  to  completely  specify  the  prior 
distribution.  This  often  leads  to  the  consideration  of  another  prior 
distribution,  the  hyperprior  distribution,  over  a  set  of  possible  prior 
distributions.  The  process  may  then  be  repeated  indefinitely  by  considering  a 
prior  distribution  over  a  set  of  possible  prior  distributions,  until  we  come 
to  the  point  where  no  more  information  is  available  to  continue  the  process. 
The  concept  of  non-informative  or  ignorance  prior  distribution  has  been 
developed  to  serve  in  this  type  of  situation. 

The  ignorance  prior  distribution  developed  by  Jeffreys  (1946)  is  well- 
known.  However,  its  definition  is  based  on  the  concept  of  invariance  of  the 
distribution  by  the  transformation  of  the  parameter  and  its  application  is 
limited  to  the  case  where  the  family  of  possible  data  distributions  is 
continuously  parametrized.  Lindley  (1956)  applied  the  Shannon  entropy  to 
develop  an  information  theoretic  analysis  of  the  structure  of  Bayesian 
modeling.  This  work  prompted  the  works  by  Zellner  (1977)  and  Bernardo  (1979) 
on  the  definition  of  the  least  informative  prior  distribution  based  on  some 
deinitions  of  the  amount  of  information.  For  an  extensive  reference  on  the 
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literature  on  non-inf ormative  prior  distributions  readers  are  referred  to 
Bernardo  ( 1979 ) . 

In  the  present  paper  we  consider  the  basic  problem  of  specifying  a  prior 
distribution  over  a  finite  number  of  data  distributions  when  no  further  prior 
information  is  available.  Coventionally  the  uniform  distribution  which 
allocates  equal  probability  to  each  data  distribution  is  considered  to  be  a 
reasonable  choice  in  such  a  situation;  see,  for  example,  Oox  and  Hinkley 
(1974,  p.  376).  The  analysis  of  Bernardo  (1979)  also  leads  to  this  prior 
distribution.  In  our  approach  we  define  the  mimimum  information  prior 
distribution  as  the  prior  distribution  which  "let  the  data  speak  most"  in 
predicting  the  behavior  of  a  future  observation  which  is  similar  in  nature  to 
the  present  data.  Such  a  prior  distribution  is  obtained  by  keeping  the 
simultaneous  distribution  of  the  present  and  future  observations  as  far  away 
as  possible  from  the  state  of  independence.  The  deviation  from  the 
independence  is  measured  by  the  Kullback-Leibler  information  number. 

Our  analysis  shows  that  the  uniform  distribution  is  a  reasonable  choice 
only  when  the  possible  data  distributions  do  not  show  significant  overlap. 

This  is  the  situation  where  the  likelihoods  can  clearly  discriminate  the 
hypotheses,  a  situation  where  the  Bayesian  modeling  is  practically 
unnecessary.  Numerical  results  show  that  when  the  overlap  of  the  data 
distributions  becomes  significant  the  optimal  choice  of  the  prior  distribution 
depends  critically  on  the  mutual  relation  of  the  data  distributions.  These 
numerical  examples  constitute  the  first  example  of  determination  of  non¬ 
trivial  non-informative  prior  distributions  over  finite  possibilities.  A 
newly  obtained  information  inequality  assures  the  validity  of  numerically 
obtained  minimum  information  prior  distributions. 

Much  remains  to  be  done  on  the  theoretical  analysis  of  the  minimum 


information  prior  distribution.  However,  the  numerical  examples  clearly  show 
that  the  concept  may  find  direct  applications  in  practical  problems  where  the 
data  distributions  can  be  lumped  into  finite  number  of  possibilities. 
Comparison  of  the  present  definition  with  other  similar  definitions  is  briefly 
discussed  in  the  final  section. 


2.  DEFINITION  OF  THE  MINIMUM  INFORMATION  PRIOR  DISTRIBUTION 

Consider  a  set  of  data  distributions  (f^C*)}  Oc  »  1,2,...,K).  The 
simultaneous  distribution  of  the  present  and  future  observations  x  and  y 
is  defined  by 

K 

p(y#x)  -  l  fk<y)fk(x)wk, 
k**i 

where  wk  denotes  the  prior  probability  of  the  kth  distribution  ) • 

The  deviation  of  this  simultaneous  distribution  from  the  state  of  independence 
is  measured  by  the  Kullback-Leiblar  information  (Kullback  and  Leibler,  1951) 

Kw)  -  //p(y,x)  i°g(pfyf^|y)dy  dx 

where  p(*  )  »  £  fk<« 

The  quantity  I(w)  is  non-negative  and  becomes  zero  when 
p(y,x)  *  p(y)p(x).  In  this  case  we  have  p(y|x)  *  p(y),  where  p(y|x) 
denotes  the  probability  density  of  y  conditional  on  x,  and  the  structure 
defined  by  {f  (y)f  (x)w  }  does  not  allow  any  transmission  of  information 

K  K 

from  the  present  observation  x  to  the  expected  behavior  of  the  future 
observation  y.  This  represents  the  situation  where  all  the  relevant 
information  about  y  is  represented  by  {fk(y)}  and  {w^}.  Since  the 


specification  of  the  prior  distribution  w  *  {w  }  has  to  be  done  before  the 
observation  of  x  the  above  specification  of  w  is  acceptable  only  when  we 
have  complete  information  on  the  behavior  of  y. 

When  we  are  not  confident  in  uniquely  specifying  a  prior  distribution  we 
may  consider  a  set  of  possible  w's.  However,  this  necessitates  the 
introduction  of  a  prior  distribution  over  the  possible  prior  distributions  and 
eventually  leads  to  the  infinite  digression  of  searching  for  prior  distri¬ 
butions  of  prior  distributions.  One  strategy  to  stop  this  digression  is  to 
introduce  a  prior  distribution  which  is  least  prejudiced  against  every 
possibility*  The  prior  distribution  discussed  in  the  preceding  paragraph  for 
which  p(y,x)  ”  p(y)p(x)  holds  can  be  considered  as  maximally  prejudiced,  or 
informative,  in  the  sense  that  no  further  observation  of  x  can  influence  on 
the  inference  of  y.  If  this  interpretation  is  accepted  then  it  is  natural  to 
consider  the  prior  distribution  with  the  corresponding  probability  distri¬ 
bution  p(y,x)  furthest  away  from  p(y)p(x)  as  the  least  informative.  This 
observation  leads  to  the  definition  of  the  minimum  information  prior  distri¬ 
bution:  we  call  a  prior  distribution  {w^}  the  minimum  information  prior 
distribution,  with  respect  to  { f ^ ( • > } ,  when  it  gives  the  maximum  of  I(w). 

In  the  rest  of  the  paper,  unless  stated  otherwise,  it  is  tacitly  assumed  that 
the  data  distributions  f x)  are  mutually  absolutely  continuous. 

3.  SOME  ANALYSIS  OF  I(w) 

The  basic  criterion  I(w)  can  be  represented  as 

I(w)  -  Shannon  entropy  of  pw(y)pw(x) 

-Shannon  entropy  of  pw(y,x). 


where  pw(x)  and  pw(y,x)  respectively  denote  p(x)  and  p(y,x)  defined  by 


the  prior  distribution  w  and  the  Shannon  entropy  of  a  probability 
distribution  p(z)  is  defined  by  -  /  p(z)log  p(z)dz.  For  the  purpose  of 
comparison  of  distributions  the  Shannon  entropy  may  be  considered  as  a  measure 
of  deviation  from  the  uniform  distribtion.  Thus  the  above  representation  of 
I(w)  shows  that  the  minimum  information  prior  distribution  that  maximises 
X(v)  will  maximize  the  dependence  between  x  and  y,  keeping  the  marginal 
distribution  pw(  x)  as  close  to  the  uniform  distribution  as  possible. 

In  the  exceptional  situation  where  the  data  distributions  are  completely 
separated,  i.e. ,  fk(x)f^(x)  -  0  for  k  *  j,  X(w)  reduces  to 

w^  log  wfc,  the  Shannon  entropy  of  the  prior  distribution  w.  This  is 
maximized  at  w^  -  1/K.  This  shows  that  when  the  data  distributions  are  well 
separated  the  uniform  prior  distribution  will  provide  a  good  approximation  to 
the  minimum  information  prior  distribution. 

When  some  of  the  data  distributions  show  significant  overlap  we  can 
eiqpect  that  the  solution  will  no  longer  be  close  to  the  uniform 
distribution.  Since  no  single  w^  can  come  close  to  1,  as  this  will 
minimize  I(w),  we  can  further  expect  that  some  w^' s  will  be  forced  to  go 
down  to  zero  and  a  distribution  in  a  lower  dimensional  space  of  w  will 
appear  as  the  solution.  The  numerical  examples  of  the  next  section  show  the 
validity  of  these  expectations. 

If  the  concavity  of  I(w)  is  shown  that  will  assure  the  validity  of  the 
minimum  information  prior  distribution  obtained  by  a  numerical  procedure  based 
on  a  local  search  for  the  maximum  of  I(w).  Consider  a  prior  distribution 
w  =>  au  +  (1-a)v  defined  by  a  pair  of  prior  distributions  u  and  v  and 
a  (0  <  a  <  1).  Denote  I{w)  by  1(a).  The  concavity  of  I(w)  for  general 
w  holds  if  it  holds  that 


for  any  pair  of  u  and  v.  This  inequality  reduces  to 


bP<y,x>  r  pv(y/x>  ] 

.  (yipjxJ dy  dx  ‘ 11  ‘>u1y->11^lp„iy>P„nnJdy 


pv(y)pv(x)J 


dx 


which  is  equivalent  to 


I(PU/PV)  <  KPUPU^  PVPV>/ 

where  I(q,p)  -  //  q(y,x)log(q(y,x)/p(y,x) )dy  dx  and  puPu(y,x)  denotes 
Pu(y)Pu(x)* 

This  last  inequality  is  an  information  inequality  that  shows  that 
pv(y)pv(x)  is  more  sensitive  to  the  variation  of  v  than  pv(y#x), 
i.e.,  an  observation  from  pv(y)pv(x)  is  more  informative  about  v  than  that 
from  pv(y,x).  To  prove  the  inequality  we  consider  the  minimum  of 
I(qq#pp)  *  If  q(y»x)log{q(y)q(x)/(p(y)p(x) )}  dy  dx  for  a  given  p(y,x), 
under  the  condition  I(q,p)  ■  0,  a  positive  constant.  Here  q(y ,x)  and 
p(y,x)  denote  arbitrary  symmetric  probability  density  functions  with  repect 
to  the  measure  dy  dx  and  q(» )  and  p(» )  denote  corresponding  marginal 
distributions.  The  minimization  leads  to  the  variational  analysis  of 

R(q)  =■  I(qq,pp)  +  A(I(q,p)  -  9)  +  U(ff  q(y,x)dy  dx  -  1), 

where  X  and  y  are  Lagrange  multipliers.  By  considering  a  small 
perturbation  r{y,x)<*  r(x,y))  of  q(y,x)  it  can  be  seen  that  the  stationary 
solution  must  satisfy  the  relation  //  r(y,x) [log{q(y)q(x)/(p(y)p(x) )}  + 

X  log(q(y,x)/p(y,x) )] dy  dx  ■  0.  This  shows  that  we  have  an  equality 
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log(q(y,x)/p(y,x) )  -  C  log  {q<y)q(x)/(p(y)p(x) )}  and  accordingly  X(q,p)  -  C 
I (qq,pp) ,  where  C  ■-!  >  0.  Due  to  the  convexity  of  X(qq,pp)  with 

respect  to  q  the  stationary  solution  gives  the  minimum  of  X (qq,pp)  under 
the  given  constraints* 

Since  we  have 


//  q(y*x)dy  dx 


n  lp(y)p(x)' 


p(y#x)dy  dx 


c  must  be  equal  to  or  less  than  1#  if  q(y)/p(y)  and  q(x)/p(x)  are 
positively  correlated  under  p(y,x)*  Xn  this  case  X(q«p)  <  X(qq,pp)  holds 
for  any  q.  For  the  particular  choice  p(y,x)  -  py(y*x)  it  can  easily  be 
seen  that  the  positivity  of  the  correlation  holds  for  any  symmetric  q(y  ,x) . 
This  completes  the  proof  of  the  information  inequality* 


! 

i 


4.  NUMERICAL  INVESTIGATION 

For  the  simplicity  of  numerical  analysis  we  consider  the  case  where  the 
variables  x  and  y  take  only  integral  values  0, 1, 2, . . . ,1*  The  quantities 
useful  for  the  numerical  maximaization  of  X(w)  are 

I(w)  -  ll  pw(y,x)s(y,x) 
y  x 


*  I  I  Dff (k,y,x)s(y/x) 


3wk 
3  2I(w) 


y  x 


V  J  Df f ( 1 «V*x)  Dff(k,y,x)  r  Df(j,x)  Df(k,x) 
3wj3wk  yx  Vy'X)  ‘  x  Pw(X> 

where  s(y,x)  =  log{pw(y,x)/(pw(y)pw(x)  )> ,  Dff(k,y,x)  -  j£  (y)jE  (x)  - 

fK(y)fK(x)  and  Df(k,x)  *  fk(x)  "  fgtx)  XyDff (k,y,x) ).  To  apply  the 

ordinary  optimization  procedure  X(w)  is  maximized  with  respect  to 


r1fW2r***/Wg_.|;  whereas  wK  is  given  by  w^  *  1  -  w^  -  *..-wK<<>^. 
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As  a  typical  set  of  data  distributions  { f ^ ( • ) }  we  adopted  a  set  of 
binomial  distributions 

,  ,  .  _  x,,  .  N-x 

fk<X’  *  NCX  Pk  "-V  ' 

where  N  and  p^  (k=1,2.  ..,K)  were  properly  chosen  for  each  particular 
example.  The  uniform  distribution  wk  =  1/K  was  used  as  the  initial  guess  to 
start  the  numerical  optimization.  An  ordinary  unconstrained  numerical 
optimization  procedure  was  applied  with  a  minor  modification  to  satisfy  the 
non-negativity  constraint  w^  >  0.  For  the  examples  to  be  discussed  in  the 
following  the  absolute  values  of  the  gradients  at  the  solutions  were  at  most 
of  the  order  IQ-6,  except  for  those  w^'s  which  were  zero  where  the 
gradients  took  significant  negative  values. 

The  first  example  was  designed  to  see  the  effect  of  relative  location  of 
the  data  distributions  on  the  determination  of  the  minimum  information  prior 
distribution.  Three  sets  of  data  distributions  were  considered,  each  composed 
of  three  data  distributions,  i.e.,  K~3.  These  were  defined  respectively  by 

( p^*0 . 1 ,  Pja0 . 5 ,  p^O .9),  (p.j=0.  2,  P2*0 .5,  p3*0 . 8 )  and  (p^~0.3,  p 2=  0 . 5 , 
p3=0.7).  The  parameter  N  of  the  binomial  distribution  was  put  equal  to 
20.  The  minimum  information  prior  distributions  obtained  numerically  are 
given  in  Table  t  along  with  the  corresponding  p^'s.  The  numbers  were  rounded 
at  the  fourth  decimal  point. 


Table  1.  Effect  Of  Relative  Location 


The  result  of  Table  1  shows  that  as  the  three  data  distributions  come 


closer  to  each  other  the  distribution  at  the  center  loses  its  prior 
probability.  One  might  expect  that  if  the  data  distributions  are  brought 
further  closer  then  eventually  the  prior  probability  will  concentrate  on  the 
distribution  at  the  center.  This  does  not  happen  for  this  example  with 
K  *  3.  However  that  type  of  behavior  is  observed  locally  in  the  example  to  be 
discussed  after  the  next  where  K  =*  5. 

The  second  example  was  designed  to  check  the  effect  of  increased 
dispersions  of  the  data  distributions.  With  K  =  3  the  p^'s  used  to  define 
the  binomial  distributions  were  pj  «  0.25,  pj  “  0.5  and  p3  -  0.75.  To  get 
distributions  with  successively  increasing  dispersions  N  was  put  equal  to 
80,  40,  30  and  20.  The  corresponding  minimum  information  prior  distributions 
are  given  in  Table  2  along  with  the  p^ 1 s . 


Table  2.  Effect  Of  Increased  Dispersions  (K  «  3 


W1 

w2 

w3 


N 


80 

40 

30 

20 

Pk 

.340 

.373 

.410 

.500 

0.25 

.321 

.255 

.179 

.000 

0.5 

.340 

.373 

.410 

.500 

0.75 

It  can  be  seen  that  as  N  is  decreased,  i.e.,  as  the  overlap  of  the  data 
distributions  is  increased,  the  minimum  information  prior  distribution 
deviates  from  the  uniform  distribution  over  the  three  data  distributions  to 
the  one  over  the  two  end  distributions,  just  as  in  the  case  of  the  first 
example. 

The  third  example  was  chosen  to  illustrate  further  the  complexity  of  the 
possible  shape  of  the  minimum  information  prior  distribution  for  an 


increased  K,  the  number  of  possible  data  distributions.  In  this  example 
K  was  put  equal  to  5  and  the  p^'s  were  pj-0.1,  P2*0.325,  p3*0.5, 
p4»0.675,  p^-0.9.  The  value  of  N  was  successively  put  equal  to  70,  60,  SO, 
40,  30,  25,  20,  10,  and  5.  The  corresponding  minimum  information  prior 
distributions  are  given  in  Table  3  along  with  the  p^'s. 


Table  3.  Effect  of  Increased  Dispersions  (K  ■  5) 


N 


70 

60 

50 

40 

30 

25 

20 

15 

10 

5 

Pk 

245 

.253 

.256 

.262 

.  276 

.  289 

.347 

.361 

.402 

.500 

.1 

196 

.200 

.244 

.238 

.224 

.211 

.000 

.000 

.000 

.000 

.3  25 

117 

.094 

.000 

.000 

.000 

.000 

.307 

.278 

.195 

.000 

.5 

196 

.200 

.244 

.238 

.224 

.211 

.000 

.000 

.000 

.000 

.675 

245 

.253 

.256 

.262 

.276 

.289 

.347 

.361 

.402 

.500 

.9 

The  result  of  Table  3  clearly  suggests  that  some  clustering  of  data 
distributions  is  required  when  there  is  significant  overlap  among  the 
distributions . 

The  fourth  and  the  last  example  was  designed  to  see  the  effect  of  the 
difference  of  dispersions  among  the  data  distributions.  Only  two  data 
distributions  were  considered.  The  result  is  given  in  Table  4.  It  can  be 
seen  that  the  data  distributions  defined  with  P^  “  »5  which  have  larger 
variances  than  those  defined  with  p^  -  .9  are  receiving  lower  prior 
probabilities.  Due  to  the  relatively  good  separations  of  the  data 
distributions  the  differences  of  the  prior  prababilities  are  rather  sntll. 
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Table  4.  Effect  Of  The  Difference  of  Dispersions 


N 

nrnmmmmmmm* 

20 

15 

10 

5 

2 

Pk 

w 1  .497  .494  .488  .471  .439  .5 

w2  .503  .506  .512  .529  .561  .9 

5.  DISCUSSION 

The  definition  of  the  minima  information  prior  distribution  introduced 
in  this  paper  is  based  on  two  principles.  The  first  is  to  specify  the  purpose 
of  the  inference  based  on  the  present  data  as  the  prediction  of  another 
similar  future  observation.  The  second  is  to  evaluate  the  deviation  of 
p(y,x)  from  p(y)p(x)  by  the  Kullbeck-Leibler  information  I(w).  For  the 
discussion  of  the  adequacy  of  the  Kullback-Leibler  information  number  as  such 
criterion,  see,  for  example,  Akaike  (1982).  Once  the  above  two  principles  are 
accepted  the  definition  of  the  minimum  information  prior  distribution  follow 
quite  naturally. 

Contrary  to  the  usual  preconception  of  the  uniform  distribution  as  the 
non-inf ormative  prior  distribution  for  a  finite  set  of  possible  data 
distributions,  the  numerical  result  has  shown  the  necessity  of  careful 
analysis  of  the  mutual  relation  among  the  data  distributions.  At  least  in 
principle  the  present  analysis  can  be  extended  to  more  complex  situations,  if 
only  the  necessary  numerical  procedure  is  properly  developed. 

If  we  followed  Lindley  (1956)  we  could  have  defined  the  minimum 
information  prior  distribution  as  that  w^  which  maximizes 

i0(«>  - 1  »k  /  iyxiionTu)J'lx' 

Such  a  prior  distribution  may  be  characterized  as  the  one  that  keeps  the 


J 


p 


probability  distribution  p^( x)w^  over  (x,k)  as  far  away  as  possible  from 
the  state  of  independence  defined  by  p( x)w^.  Since  we  have  the  relation 

I  (w)  *  /  p<x>(I  p(k|x)log  }dx, 

where  p(k|x)  =  f^txjw^/plx) ,  the  prior  distribution  that  maximizes  IQ(w) 
may  also  be  characterized  as  the  one  that  produces  maximum  expected  change  in 
the  transition  from  {w^}  to  {p(kjx)}. 

This  definition  leads  to  a  numerical  optimization  problem  which  is 
simpler  than  that  of  our  definition.  The  result  corresponding  to  Table  3  is 
given  in  Table  5  for  this  definition.  The  computations  for  the  cases  N  -  40 
and  30  were  omitted.  By  comparing  Table  5  with  Table  3  we  can  see  that  the 
present  definition  leads  to  a  prior  distribution  which  is  closer  to  the 
uniform  distribution  than  that  by  our  definition.  This  shows  that  the 
predictive  point  of  view  demands  more  adaptive  choice  of  the  prior  distri¬ 
bution  . 


Table  S.  Prior  Distributions  Maximizing  Ig(w) 


N 


70 

6) 

50 

40 

30 

25 

20 

15 

10 

5 

Pk 

226 

.232 

.239 

.270 

.284 

.310 

.363 

.4  24 

.1 

194 

.  193 

.  192 

.178 

.162 

.117 

.000 

.000 

.325 

158 

.149 

.138 

.103 

.108 

.147 

.275 

.  151 

.5 

194 

.193 

.192 

.  178 

.  162 

.  117 

.000 

.000 

.675 

226 

.232 

.239 

.270 

.284 

.310 

.363 

.4  24 

.9 

'5 


The  maximal  data  information  prior  distribution  introduced  by  Zellner 
(1977)  is  based  on  a  modification  of  I0(w)  to  avoid  the  analytical 
difficulty  in  handling  IQ(w).  The  criterion  is  based  on  somewhat  formal  use 
of  the  Shannon  entropy  and  its  technical  meaning  is  rather  unclear,  unless  we 
accept  the  Shannon  entropy  literally  as  a  representation  of  the  amount  of 
information.  The  reference  prior  distribution  introduced  by  Bernardo  (1979) 
is  somewhat  similar  to  our  minimum  information  prior  distribution.  However, 
it  is  based  on  the  concept  of  infinitely  repeated  observation  of  x,  instead 
of  the  one  single  observation  in  our  definition,  and  inevitably  leads  to  the 
uniform  prior  distribution  when  the  number  of  possible  data  distributions  is 
finite. 

Since  statistics  is  developed  to  handle  problems  in  the  real  world,  no 
procedure  can  claim  its  superiority  to  others  unless  it  is  tested  with  real 
applications.  In  that  sense  much  remains  to  be  done  to  clarify  the  practical 
implication  of  the  minimum  information  prior  distribution.  Nevertheless,  the 
clarity  of  its  technical  meaning  and  the  reasonable  behavior  of  the  numerical 
examples  suggest  the  potential  of  the  minimum  information  prior  as  a 
conceptual  resort  in  terminating  the  notorious  indefinite  digression  in 
Bayesian  modeling. 
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